Picture decoding method

ABSTRACT

The invention relates to method for buffering encoded pictures. The method includes an encoding step for forming encoded pictures in an encoder. The method also includes a transmission step for transmitting said encoded pictures to a decoder as transmission units, a buffering step for buffering transmission units transmitted to the decoder in a buffer, and a decoding step for decoding the encoded pictures for forming decoded pictures. The buffer size is defined so that the total size of at least two transmission units is defined and the maximum buffer size is defined on the basis of the total size.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims priority under 35 USC §119 to U.S. ProvisionalPatent Application No. 60/544,598 filed on Feb. 13, 2004.

FIELD OF THE INVENTION

The present invention relates to a method for buffering encodedpictures, the method including an encoding step for forming encodedpictures in an encoder, a transmission step for transmitting saidencoded pictures to a decoder, a decoding step for decoding the encodedpictures for forming decoded pictures, and rearranging step forarranging the decoded pictures in decoding order. The invention alsorelates to a system, transmitting device, receiving device, an encoder,a decoder, an electronic device, a software program, and a storagemedium.

BACKGROUND OF THE INVENTION

Published video coding standards include ITU-T H.261, ITU-T H.263,ISO/IEC MPEG-1, ISO/IEC MPEG-2, and ISO/IEC MPEG-4 Part 2. Thesestandards are herein referred to as conventional video coding standards.

Video Communication Systems

Video communication systems can be divided into conversational andnon-conversational systems. Conversational systems include videoconferencing and video telephony. Examples of such systems include ITU-TRecommendations H.320, H.323, and H.324 that specify a videoconferencing/telephony system operating in ISDN, IP, and PSTN networksrespectively. Conversational systems are characterized by the intent tominimize the end-to-end delay (from audio-video capture to the far-endaudio-video presentation) in order to improve the user experience.

Non-conversational systems include playback of stored content, such asDigital Versatile Disks (DVDs) or video files stored in a mass memory ofa playback device, digital TV, and streaming. A short review of the mostimportant standards in these technology areas is given below.

A dominant standard in digital video consumer electronics today isMPEG-2, which includes specifications for video compression, audiocompression, storage, and transport. The storage and transport of codedvideo is based on the concept of an elementary stream. An elementarystream consists of coded data from a single source (e.g. video) plusancillary data needed for synchronization, identification andcharacterization of the source information. An elementary stream ispacketized into either constant-length or variable-length packets toform a Packetized Elementary Stream (PES). Each PES packet consists of aheader followed by stream data called the payload. PES packets fromvarious elementary streams are combined to form either a Program Stream(PS) or a Transport Stream (TS). PS is aimed at applications havingnegligible transmission errors, such as store-and-play type ofapplications. TS is aimed at applications that are susceptible oftransmission errors. However, TS assumes that the network throughput isguaranteed to be constant.

There is a standardization effort going on in a Joint Video Team (JVT)of ITU-T and ISO/IEC. The work of JVT is based on an earlierstandardization project in ITU-T called H.26L. The goal of the JVTstandardization is to release the same standard text as ITU-TRecommendation H.264 and ISO/IEC International Standard 14496-10 (MPEG-4Part 10). The draft standard is referred to as the JVT coding standardin this paper, and the codec according to the draft standard is referredto as the JVT codec.

The codec specification itself distinguishes conceptually between avideo coding layer (VCL), and a network abstraction layer (NAL). The VCLcontains the signal processing functionality of the codec, things suchas transform, quantization, motion search/compensation, and the loopfilter. It follows the general concept of most of today's video codecs,a macroblock-based coder that utilizes inter picture prediction withmotion compensation, and transform coding of the residual signal. Theoutput of the VCL are slices: a bit string that contains the macroblockdata of an integer number of macroblocks, and the information of theslice header (containing the spatial address of the first macroblock inthe slice, the initial quantization parameter, and similar). Macroblocksin slices are ordered in scan order unless a different macroblockallocation is specified, using the so-called Flexible MacroblockOrdering syntax. In-picture prediction is used only within a slice.

The NAL encapsulates the slice output of the VCL into NetworkAbstraction Layer Units (NALUs), which are suitable for the transmissionover packet networks or the use in packet oriented multiplexenvironments. JVT's Annex B defines an encapsulation process to transmitsuch NALUs over byte-stream oriented networks.

The optional reference picture selection mode of H.263 and the NEWPREDcoding tool of MPEG-4 Part 2 enable selection of the reference frame formotion compensation per each picture segment, e.g., per each slice inH.263. Furthermore, the optional Enhanced Reference Picture Selectionmode of H.263 and the JVT coding standard enable selection of thereference frame for each macroblock separately.

Reference picture selection enables many types of temporal scalabilityschemes. FIG. 1 shows an example of a temporal scalability scheme, whichis herein referred to as recursive temporal scalability. The examplescheme can be decoded with three constant frame rates. FIG. 2 depicts ascheme referred to as Video Redundancy Coding, where a sequence ofpictures is divided into two or more independently coded threads in aninterleaved manner. The arrows in these and all the subsequent figuresindicate the direction of motion compensation and the values under theframes correspond to the relative capturing and displaying times of theframes.

Parameter Set Concept

One very fundamental design concept of the JVT codec is to generateself-contained packets, to make mechanisms such as the headerduplication unnecessary. The way how this was achieved is to decoupleinformation that is relevant to more than one slice from the mediastream. This higher layer meta information should be sent reliably,asynchronously and in advance from the RTP packet stream that containsthe slice packets. This information can also be sent in-band in suchapplications that do not have an out-of-band transport channelappropriate for the purpose. The combination of the higher levelparameters is called a Parameter Set. The Parameter Set containsinformation such as picture size, display window, optional coding modesemployed, macroblock allocation map, and others.

In order to be able to change picture parameters (such as the picturesize), without having the need to transmit Parameter Set updatessynchronously to the slice packet stream, the encoder and decoder canmaintain a list of more than one Parameter Set. Each slice headercontains a codeword that indicates the Parameter Set to be used.

This mechanism allows to decouple the transmission of the Parameter Setsfrom the packet stream, and transmit them by external means, e.g. as aside effect of the capability exchange, or through a (reliable orunreliable) control protocol. It may even be possible that they getnever transmitted but are fixed by an application design specification.

Transmission Order

In conventional video coding standards, the decoding order of picturesis the same as the display order except for B pictures. A block in aconventional B picture can be bi-directionally temporally predicted fromtwo reference pictures, where one reference picture is temporallypreceding and the other reference picture is temporally succeeding indisplay order. Only the latest reference picture in decoding order cansucceed the B picture in display order (exception: interlaced coding inH.263 where both field pictures of a temporally subsequent referenceframe can precede a B picture in decoding order). A conventional Bpicture cannot be used as a reference picture for temporal prediction,and therefore a conventional B picture can be disposed without affectingthe decoding of any other pictures.

The JVT coding standard includes the following novel technical featurescompared to earlier standards:

-   -   The decoding order of pictures is decoupled from the display        order. The picture number indicates decoding order and the        picture order count indicates the display order.    -   Reference pictures for a block in a B picture can either be        before or after the B picture in display order. Consequently, a        B picture stands for a bi-predictive picture instead of a        bi-directional picture.    -   Pictures that are not used as reference pictures are marked        explicitly. A picture of any type (intra, inter, B, etc.) can        either be a reference picture or a non-reference picture. (Thus,        a B picture can be used as a reference picture for temporal        prediction of other pictures.)    -   A picture can contain slices that are coded with a different        coding type. In other words, a coded picture may consist of an        intra-coded slice and a B-coded slice, for example.

Decoupling of display order from decoding order can be beneficial fromcompression efficiency and error resiliency point of view.

An example of a prediction structure potentially improving compressionefficiency is presented in FIG. 3. Boxes indicate pictures, capitalletters within boxes indicate coding types, numbers within boxes arepicture numbers according to the JVT coding standard, and arrowsindicate prediction dependencies. Note that picture B17 is a referencepicture for pictures B18. Compression efficiency is potentially improvedcompared to conventional coding, because the reference pictures forpictures B18 are temporally closer compared to conventional coding withPBBP or PBBBP coded picture patterns. Compression efficiency ispotentially improved compared to conventional PBP coded picture pattern,because part of reference pictures are bi-directionally predicted.

FIG. 4 presents an example of the intra picture postponement method thatcan be used to improve error resiliency. Conventionally, an intrapicture is coded immediately after a scene cut or as a response to anexpired intra picture refresh period, for example. In the intra picturepostponement method, an intra picture is not coded immediately after aneed to code an intra picture arises, but rather a temporally subsequentpicture is selected as an intra picture. Each picture between the codedintra picture and the conventional location of an intra picture ispredicted from the next temporally subsequent picture. As FIG. 4 shows,the intra picture postponement method generates two independent interpicture prediction chains, whereas conventional coding algorithmsproduce a single inter picture chain. It is intuitively clear that thetwo-chain approach is more robust against erasure errors than theone-chain conventional approach. If one chain suffers from a packetloss, the other chain may still be correctly received. In conventionalcoding, a packet loss always causes error propagation to the rest of theinter picture prediction chain.

Two types of ordering and timing information have been conventionallyassociated with digital video: decoding and presentation order. A closerlook at the related technology is taken below.

A decoding timestamp (DTS) indicates the time relative to a referenceclock that a coded data unit is supposed to be decoded. If DTS is codedand transmitted, it serves for two purposes: First, if the decodingorder of pictures differs from their output order, DTS indicates thedecoding order explicitly. Second, DTS guarantees a certain pre-decoderbuffering behavior provided that the reception rate is close to thetransmission rate at any moment. In networks where the end-to-endlatency varies, the second use of DTS plays no or little role. Instead,received data is decoded as fast as possible provided that there is roomin the post-decoder buffer for uncompressed pictures.

Carriage of DTS depends on the communication system and video codingstandard in use. In MPEG-2 Systems, DTS can optionally be transmitted asone item in the header of a PES packet. In the JVT coding standard, DTScan optionally be carried as a part of Supplemental EnhancementInformation (SEI), and it is used in the operation of the optionalHypothetical Reference Decoder. In ISO Base Media File Format, DTS isdedicated its own box type, Decoding Time to Sample Box. In manysystems, such as RTP-based streaming systems, DTS is not carried at all,because decoding order is assumed to be the same as transmission orderand exact decoding time does not play an important role. H.263 optionalAnnex U and Annex W.6.12 specify a picture number that is incremented by1 relative to the previous reference picture in decoding order. In theJVT coding standard, the frame number coding element is specifiedsimilarly to the picture number of H.263. The JVT coding standardspecifies a particular type of an intra picture, called an instantaneousdecoder refresh (IDR) picture. No subsequent picture can refer topictures that are earlier than the IDR picture in decoding order. An IDRpicture is often coded as a response to a scene change. In the JVTcoding standard, frame number is reset to 0 at an IDR picture in orderto improve error resilience in case of a loss of the IDR picture as ispresented in FIGS. 5 a and 5 b. However, it should be noted that thescene information SEI message of the JVT coding standard can also beused for detecting scene changes.

H.263 picture number can be used to recover the decoding order ofreference pictures. Similarly, the JVT frame number can be used torecover the decoding order of frames between an IDR picture (inclusive)and the next IDR picture (exclusive) in decoding order. However, becausethe complementary reference field pairs (consecutive pictures coded asfields that are of different parity) share the same frame number, theirdecoding order cannot be reconstructed from the frame numbers.

The H.263 picture number or JVT frame number of a non-reference pictureis specified to be equal to the picture or frame number of the previousreference picture in decoding order plus 1. If several non-referencepictures are consecutive in decoding order, they share the same pictureor frame number. The picture or frame number of a non-reference pictureis also the same as the picture or frame number of the followingreference picture in decoding order. The decoding order of consecutivenon-reference pictures can be recovered using the Temporal Reference(TR) coding element in H.263 or the Picture Order Count (POC) concept ofthe JVT coding standard.

A presentation timestamp (PTS) indicates the time relative to areference clock when a picture is supposed to be displayed. Apresentation timestamp is also called a display timestamp, outputtimestamp, and composition timestamp.

Carriage of PTS depends on the communication system and video codingstandard in use. In MPEG-2 Systems, PTS can optionally be transmitted asone item in the header of a PES packet. In the JVT coding standard, PTScan optionally be carried as a part of Supplemental EnhancementInformation (SEI), and it is used in the operation of the HypotheticalReference Decoder. In ISO Base Media File Format, PTS is dedicated itsown box type, Composition Time to Sample Box where the presentationtimestamp is coded relative to the corresponding decoding timestamp. InRTP, the RTP timestamp in the RTP packet header corresponds to PTS.

Conventional video coding standards feature the Temporal Reference (TR)coding element that is similar to PTS in many aspects. In some of theconventional coding standards, such as MPEG-2 video, TR is reset to zeroat the beginning of a Group of Pictures (GOP). In the JVT codingstandard, there is no concept of time in the video coding layer. ThePicture Order Count (POC) is specified for each frame and field and itis used similarly to TR in direct temporal prediction of B slices, forexample. POC is reset to 0 at an IDR picture.

Transmission of Multimedia Streams

A multimedia streaming system consists of a streaming server and anumber of players, which access the server via a network. The network istypically packet-oriented and provides little or no means to guaranteedquality of service. The players fetch either pre-stored or livemultimedia content from the server and play it back in real-time whilethe content is being downloaded. The type of communication can be eitherpoint-to-point or multicast. In point-to-point streaming, the serverprovides a separate connection for each player. In multicast streaming,the server transmits a single data stream to a number of players, andnetwork elements duplicate the stream only if it is necessary.

When a player has established a connection to a server and requested fora multimedia stream, the server begins to transmit the desired stream.The player does not start playing the stream back immediately, butrather it typically buffers the incoming data for a few seconds. Herein,this buffering is referred to as initial buffering. Initial bufferinghelps to maintain pauseless playback, because, in case of occasionalincreased transmission delays or network throughput drops, the playercan decode and play buffered data.

In order to avoid unlimited transmission delay, it is uncommon to favorreliable transport protocols in streaming systems. Instead, the systemsprefer unreliable transport protocols, such as UDP, which, on one hand,inherit a more stable transmission delay, but, on the other hand, alsosuffer from data corruption or loss.

RTP and RTCP protocols can be used on top of UDP to control real-timecommunications. RTP provides means to detect losses of transmissionpackets, to reassemble the correct order of packets in the receivingend, and to associate a sampling time-stamp with each packet. RTCPconveys information about how large a portion of packets were correctlyreceived, and, therefore, it can be used for flow control purposes.

Transmission Errors

There are two main types of transmission errors, namely bit errors andpacket errors. Bit errors are typically associated with acircuit-switched channel, such as a radio access network connection inmobile communications, and they are caused by imperfections of physicalchannels, such as radio interference. Such imperfections may result intobit inversions, bit insertions and bit deletions in transmitted data.Packet errors are typically caused by elements in packet-switchednetworks. For example, a packet router may become congested; i.e. it mayget too many packets as input and cannot output them at the same rate.In this situation, its buffers overflow, and some packets get lost.Packet duplication and packet delivery in different order thantransmitted are also possible but they are typically considered to beless common than packet losses. Packet errors may also be caused by theimplementation of the used transport protocol stack. For example, someprotocols use checksums that are calculated in the transmitter andencapsulated with source-coded data. If there is a bit inversion errorin the data, the receiver cannot end up into the same checksum, and itmay have to discard the received packet. Second (2G) and thirdgeneration (3G) mobile networks, including GPRS, UMTS, and CDMA-2000,provide two basic types of radio link connections, acknowledged andnon-acknowledged. An acknowledged connection is such that the integrityof a radio link frame is checked by the recipient (either the MobileStation, MS, or the Base Station Subsystem, BSS), and, in case of atransmission error, a retransmission request is given to the other endof the radio link. Due to link layer retransmission, the originator hasto buffer a radio link frame until a positive acknowledgement for theframe is received. In harsh radio conditions, this buffer may overflowand cause data loss. Nevertheless, it has been shown that it isbeneficial to use the acknowledged radio link protocol mode forstreaming services. A non-acknowledged connection is such that erroneousradio link frames are typically discarded.

Packet losses can either be corrected or concealed. Loss correctionrefers to the capability to restore lost data perfectly as if no losseshad ever been introduced. Loss concealment refers to the capability toconceal the effects of transmission losses so that they should not bevisible in the reconstructed video sequence.

When a player detects a packet loss, it may request for a packetretransmission. Because of the initial buffering, the retransmittedpacket may be received before its scheduled playback time. Somecommercial Internet streaming systems implement retransmission requestsusing proprietary protocols. Work is going on in IETF to standardize aselective retransmission request mechanism as a part of RTCP.

A common feature for all of these retransmission request protocols isthat they are not suitable for multicasting to a large number ofplayers, as the network traffic may increase drastically. Consequently,multicast streaming applications have to rely on non-interactive packetloss control.

Point-to-point streaming systems may also benefit from non-interactiveerror control techniques. First, some systems may not contain anyinteractive error control mechanism or they prefer not to have anyfeedback from players in order to simplify the system. Second,retransmission of lost packets and other forms of interactive errorcontrol typically take a larger portion of the transmitted data ratethan non-interactive error control methods. Streaming servers have toensure that interactive error control methods do not reserve a majorportion of the available network throughput. In practice, the serversmay have to limit the amount of interactive error control operations.Third, transmission delay may limit the number of interactions betweenthe server and the player, as all interactive error control operationsfor a specific data sample should preferably be done before the datasample is played back.

Non-interactive packet loss control mechanisms can be categorized toforward error control and loss concealment by post-processing. Forwarderror control refers to techniques in which a transmitter adds suchredundancy to transmitted data that receivers can recover at least partof the transmitted data even if there are transmission losses. Errorconcealment by post-processing is totally receiver-oriented. Thesemethods try to estimate the correct representation of erroneouslyreceived data.

Most video compression algorithms generate temporally predicted INTER orP pictures. As a result, a data loss in one picture causes visibledegradation in the consequent pictures that are temporally predictedfrom the corrupted one. Video communication systems can either concealthe loss in displayed images or freeze the latest correct picture ontothe screen until a frame which is independent from the corrupted frameis received.

In conventional video coding standards, the decoding order is coupledwith the output order. In other words, the decoding order of I and Ppictures is the same as their output order, and the decoding order of aB picture immediately follows the decoding order of the latter referencepicture of the B picture in output order. Consequently, it is possibleto recover the decoding order based on known output order. The outputorder is typically conveyed in the elementary video bitstream in theTemporal Reference (TR) field and also in the system multiplex layer,such as in the RTP header. Thus, in conventional video coding standards,the presented problem did not exist.

One solution that is evident for an expert in the field is to use aframe counter similar to H.263 picture number without a reset to 0 at anIDR picture (as done in the JVT coding standard). However, some problemsmay occur when that kind of solutions are used. FIG. 5 a presents asituation in which continuous numbering scheme is used. If, for example,the IDR picture 137 is lost (can not be received/decoded), the decodercontinues to decode the succeeding pictures, but it uses a wrongreference picture. This causes error propagation to succeeding framesuntil the next frame, which is independent from the corrupted frame, isreceived and decoded correctly. In the example of FIG. 5 b the framenumber is reset to 0 at an IDR picture. Now, in a situation in which IDRpicture 10 is lost, the decoder notifies that there is a big gap inpicture numbering after the latest correctly decoded picture P36. Thedecoder can then assume that an error has occurred and can freeze thedisplay to the picture P36 until the next frame which is independentfrom the corrupted frame is received and decoded.

Sub-Sequences

The JVT coding standard also includes a sub-sequence concept, which canenhance temporal scalability compared to the use of non-referencepicture so that inter-predicted chains of pictures can be disposed as awhole without affecting the decodability of the rest of the codedstream.

A sub-sequence is a set of coded pictures within a sub-sequence layer. Apicture shall reside in one sub-sequence layer and in one sub-sequenceonly. A sub-sequence shall not depend on any other sub-sequence in thesame or in a higher sub-sequence layer. A sub-sequence in layer 0 can bedecoded independently of any other sub-sequences and previous long-termreference pictures. FIG. 6 a discloses an example of a picture streamcontaining sub-sequences at layer 1.

A sub-sequence layer contains a subset of the coded pictures in asequence. Sub-sequence layers are numbered with non-negative integers. Alayer having a larger layer number is a higher layer than a layer havinga smaller layer number. The layers are ordered hierarchically based ontheir dependency on each other so that a layer does not depend on anyhigher layer and may depend on lower layers. In other words, layer 0 isindependently decodable, pictures in layer 1 may be predicted from layer0, pictures in layer 2 may be predicted from layers 0 and 1, etc. Thesubjective quality is expected to increase along with the number ofdecoded layers.

The sub-sequence concept is included in the JVT coding standard asfollows: The required_frame_num_update_behaviour_flag equal to 1 in thesequence parameter set signals that the coded sequence may not containall sub-sequences. The usage of therequired_frame_num_update_behaviour_flag releases the requirement forthe frame number increment of 1 for each reference frame. Instead, gapsin frame numbers are marked specifically in the decoded picture buffer.If a “missing” frame number is referred to in inter prediction, a lossof a picture is inferred. Otherwise, frames corresponding to “missing”frame numbers are handled as if they were normal frames inserted to thedecoded picture buffer with the sliding window buffering mode. All thepictures in a disposed sub-sequence are consequently assigned a“missing” frame number in the decoded picture buffer, but they are neverused in inter prediction for other sub-sequences.

The JVT coding standard also includes optional sub-sequence related SEImessages. The sub-sequence information SEI message is associated withthe next slice in decoding order. It signals the sub-sequence layer andsub-sequence identifier (sub_seq_id) of the sub-sequence to which theslice belongs.

Each IDR picture contains an identifier (idr_pic_id). If two IDRpictures are consecutive in decoding order, without any interveningpicture, the value of idr_pic_id shall change from the first IDR pictureto the other one. If the current picture resides in a sub-sequence whosefirst picture in decoding order is an IDR picture, the value ofsub_(—seq)_id shall be the same as the value of idr_pic_id of the IDRpicture.

The solution in JVT-D093 works correctly only if no data resides insub-sequence layers 1 or above. If transmission order differs fromdecoding order and coded pictures resided in sub-sequence layer 1, theirdecoding order relative to pictures in sub-sequence layer 0 could not beconcluded based on sub-sequence identifiers and frame numbers. Forexample, consider the following coding scheme presented on FIG. 6 bwhere output order runs from left to right, boxes indicate pictures,capital letters within boxes indicate coding types, numbers within boxesare frame numbers according to the JVT coding standard, underlinedcharacters indicate non-reference pictures, and arrows indicateprediction dependencies. If pictures are transmitted in order 10, P1,P3, 10, P1, B2, B4, P5, it cannot be concluded to which independent GOPpicture B2 belongs.

It could be argued that in the previous example the correct independentGOP for picture B2 could be concluded based on its output timestamp.However, the decoding order of pictures cannot be recovered based onoutput timestamps and picture numbers, because decoding order and outputorder are decoupled. Consider the following example (FIG. 6 c ) whereoutput order runs from left to right, boxes indicate pictures, capitalletters within boxes indicate coding types, numbers within boxes areframe numbers according to the JVT coding standard, and arrows indicateprediction dependencies. If pictures are transmitted out of decodingorder, it cannot be reliably detected whether picture P4 should bedecoded after P3 of the first or second independent GOP in output order.

Buffering

Streaming clients typically have a receiver buffer that is capable ofstoring a relatively large amount of data. Initially, when a streamingsession is established, a client does not start playing the stream backimmediately, but rather it typically buffers the incoming data for a fewseconds. This buffering helps to maintain continuous playback, because,in case of occasional increased transmission delays or networkthroughput drops, the client can decode and play buffered data.Otherwise, without initial buffering, the client has to freeze thedisplay, stop decoding, and wait for incoming data. The buffering isalso necessary for either automatic or selective retransmission in anyprotocol level. If any part of a picture is lost, a retransmissionmechanism may be used to resend the lost data. If the retransmitted datais received before its scheduled decoding or playback time, the loss isperfectly recovered.

Coded pictures can be ranked according to their importance in thesubjective quality of the decoded sequence. For example, non-referencepictures, such as conventional B pictures, are subjectively leastimportant, because their absence does not affect decoding of any otherpictures. Subjective ranking can also be made on data partition or slicegroup basis. Coded slices and data partitions that are subjectively themost important can be sent earlier than their decoding order indicates,whereas coded slices and data partitions that are subjectively the leastimportant can be sent later than their natural coding order indicates.Consequently, any retransmitted parts of the most important slice anddata partitions are more likely to be received before their scheduleddecoding or playback time compared to the least important slices anddata partitions.

Pre-Decoder Buffering

Pre-decoder buffering refers to buffering of coded data before it isdecoded. Initial buffering refers to pre-decoder buffering at thebeginning of a streaming session. Initial buffering is conventionallydone for two reasons explained below.

In conversational packet-switched multimedia systems, e.g., in IP-basedvideo conferencing systems, different types of media are normallycarried in separate packets. Moreover, packets are typically carried ontop of a best-effort network that cannot guarantee a constanttransmission delay, but rather the delay may vary from packet to packet.Consequently, packets having the same presentation (playback) time-stampmay not be received at the same time, and the reception interval of twopackets may not be the same as their presentation interval (in terms oftime). Thus, in order to maintain playback synchronization betweendifferent media types and to maintain the correct playback rate, amultimedia terminal typically buffers received data for a short period(e.g. less than half a second) in order to smooth out delay variation.Herein, this type of a buffer component is referred as a delay jitterbuffer. Buffering can take place before and/or after media datadecoding.

Delay jitter buffering is also applied in streaming systems. Due to thefact that streaming is a non-conversational application, the delayjitter buffer required may be considerably larger than in conversationalapplications. When a streaming player has established a connection to aserver and requested a multimedia stream to be downloaded, the serverbegins to transmit the desired stream. The player does not start playingthe stream back immediately, but rather it typically buffers theincoming data for a certain period, typically a few seconds. Herein,this buffering is referred to as initial buffering. Initial bufferingprovides the ability to smooth out transmission delay variations in amanner similar to that provided by delay jitter buffering inconversational applications. In addition, it may enable the use of link,transport, and/or application layer retransmissions of lost protocoldata units (PDUs). The player can decode and play buffered data whileretransmitted PDUs may be received in time to be decoded and played backat the scheduled moment.

Initial buffering in streaming clients provides yet another advantagethat cannot be achieved in conversational systems: it allows the datarate of the media transmitted from the server to vary. In other words,media packets can be temporarily transmitted faster or slower than theirplayback rate as long as the receiver buffer does not overflow orunderflow. The fluctuation in the data rate may originate from twosources.

First, the compression efficiency achievable in some media types, suchas video, depends on the contents of the source data. Consequently, if astable quality is desired, the bit-rate of the resulting compressedbit-stream varies. Typically, a stable audio-visual quality issubjectively more pleasing than a varying quality. Thus, initialbuffering enables a more pleasing audio-visual quality to be achievedcompared with a system without initial buffering, such as a videoconferencing system.

Second, it is commonly known that packet losses in fixed IP networksoccur in bursts. In order to avoid bursty errors and high peak bit- andpacket-rates, well-designed streaming servers schedule the transmissionof packets carefully. Packets may not be sent precisely at the rate theyare played back at the receiving end, but rather the servers may try toachieve a steady interval between transmitted packets. A server may alsoadjust the rate of packet transmission in accordance with prevailingnetwork conditions, reducing the packet transmission rate when thenetwork becomes congested and increasing it if network conditions allow,for example.

Hypothetical Reference Decoder (HRD)/Video Buffering Verifier (VBV)

Many video coding standards include a HRD/VBV specification as anintegral part of the standard. The HRD/VBV specification is ahypothetical decoder model that contains an input (pre-decoder) buffer.The coded data flows in to the input buffer typically at a constant bitrate. Coded pictures are removed from the input buffer at their decodingtimestamps, which may be the same as their output timestamps. The inputbuffer is of certain size depending on the profile and level in use. TheHRD/VBV model is used to specify interoperability points from processingand memory requirements point of view. Encoders shall guarantee that agenerated bitstream conforms to the HRD/NBV specification according toHRD/VBV parameter values of certain profile and level. Decoders claimingthe support for a certain profile and level shall be able to decode thebitstream that conforms to the HRD/VBV model.

The HRD comprises a coded picture buffer for storing coded data streamand a decoded picture buffer for storing decoded reference pictures andfor reordering decoded pictures in display order. The HRD moves databetween the buffers similarly to the decoder of an decoding device does.However, the HRD need not decode the coded pictures entirely nor outputthe decoded pictures, but the HRD only checks that the decoding of thepicture stream can be performed under the constraints given in thecoding standard. When the HRD is operating, it receives a coded datastream and stores it to the coded picture buffer. In addition, the HRDremoves coded pictures from the coded picture buffer and stores at leastsome of the corresponding hypothetically decoded pictures into thedecoded picture buffer. The HRD is aware of the input rate according towhich the coded data flows into the coded picture buffer, the removalrate of the pictures from the coded picture buffer, and the output rateof the pictures from the decoded picture buffer. The HRD checks forcoded or decoded picture buffer overflows, and it indicates if thedecoding is not possible with the current settings. Then the HRD informsthe encoder about the buffering violation wherein the encoder can changethe encoding parameters by, for example, reducing the number ofreference frames, to avoid buffering violation. Alternatively oradditionally, the encoder starts to encode the pictures with the newparameters and sends the encoded pictures to the HRD which againperforms the decoding of the pictures and the necessary checks. As a yetanother alternative, the encoder may discard the latest encoded frameand encode later frames so that no buffering violation happens.

Two types of decoder conformance have been specified in the JVT codingstandard: output order conformance (VCL conformance) and output timeconformance (VCL-NAL conformance). These types of conformance have beenspecified using the HRD specification. The output order conformancerefers to the ability of the decoder to recover the output order ofpictures correctly. The HRD specification includes a “bumping decoder”model that outputs the earliest uncompressed picture in output orderwhen a new storage space for a picture is needed. The output timeconformance refers to the ability of the decoder to output pictures atthe same pace as the HRD model does. The output timestamp of a picturemust always be equal to or smaller than the time when it would beremoved from the “bumping decoder”.

Interleaving

Frame interleaving is a commonly used technique in audio streaming. Inthe frame interleaving technique, one RTP packet contains audio framesthat are not consecutive in decoding or output order. If one packet inthe audio packet stream is lost, the correctly received packets containneighbouring audio frames which can be used for concealing the lostaudio packet (by some sort of interpolating). Many audio coding RTPpayload and MIME type specifications contain the possibility to signalthe maximum amount of interleaving in one packet in terms of audioframes.

In some prior art encoding/decoding methods the size of the neededbuffer is informed as a count of transmission units.

SUMMARY OF THE INVENTION

The maximum size of the predecoding buffer of a decoder can be informedas bytes to the decoder. If the byte based scheme is used and thereordering process is not defined for the decoder, the buffering modelhas to be explicitly defined, because the encoder and decoder may usedifferent buffering schemes. If a certain size in bytes is defined forthe buffer and the decoder uses a buffering scheme in which transmissionunits are stored to the buffer until the buffer is full and only afterthat the oldest data is removed from the buffer and decoded. That kindof buffering may last longer than necessary before the decoding isstarted.

Another possibility to inform the maximum size of the predecoding bufferis to use transmission units, therein the size of the buffer is informedas maximum amount of transmission units to be buffered. However, themaximum size of the transmission unit is not defined and the size of thetransmission unit may vary. If the maximum size were defined and if thesize is too small for a certain data unit, the data unit has to bedivided into more than one transmission unit, which increases encodingand transmission overhead i.e. decreases the compression efficiencyand/or increases system complexity. The maximum size of the transmissionunit should be large enough wherein the total size of the buffer may beunnecessarily large.

In the present invention the buffer size is defined so that the totalsize of at least two transmission units is defined and the maximumbuffer size is defined on the basis of the total size. In addition tothe total size it may be necessary to take into account a networktransmission jitter.

According to another aspect of the present invention the number oftransmission units used in the calculation of the total size is afractional number of the necessary buffer size in terms of the number oftransmission units.

According to still another aspect of the present invention the number oftransmission units used in the calculation of the total size is afractional number of the necessary buffer size in terms of the number oftransmission units, wherein the fractional number is of the form 1/N Nbeing an integer number.

According to yet another aspect of the present invention the number oftransmission units used in the calculation of the total size is the sameas the necessary buffer size in terms of the number of transmissionunits.

In an embodiment of the present invention the number of transmissionunits used in the calculation of the total size is expressed as inbuffering order of the transmission units. The buffering order relatesto the order the transmission units are buffered in the decoder fordecoding i.e. the buffering order in the predecoder buffer.

The invention enables defining the size of the receiving buffer to thedecoder.

In the following, an independent GOP consists of pictures from an IDRpicture (inclusive) to the next IDR picture (exclusive) in decodingorder.

In the present invention a parameter signalling the maximum amount ofrequired buffering, is proposed. Several units for such parameter wereconsidered: duration, bytes, coded pictures, frames, VCL NAL units, alltypes of NAL units, and RTP packets or payloads. Specifying the amountof disorder in duration causes a dependency between the transmission bitrate and the specified duration to conclude the required amount ofbuffering in bytes. As the transmission bit rate is not generally known,the duration-based approach is not used. Specifying the amount ofdisorder in number of bytes would require the transmitter to check thetransmitted stream carefully so that the signalled limit would not beexceeded. This approach requires a lot of processing power from allservers. It would also require specifying a buffering verifier forservers. Specifying the amount of disorder in coded pictures or framesis too coarse a unit, since a simple slice interleaving method fordecoders that do not support arbitrary slice ordering would require asub-picture resolution to achieve minimal latency of buffering forrecovery of the decoding order. Specifying the amount of disorder innumber of RTP packets was not considered as appropriate, becausedifferent types of aggregate packets may exist depending on theprevailing network conditions. Thus, one RTP packet may contain avarying amount of data. Different SEI messages may be transmitteddepending on the prevailing network conditions. For example, inrelatively bad conditions, it is beneficial to transmit SEI messagesthat are targeted for error resilience, such as the scene informationSEI message. Thus, the amount of disorder in number of all types of NALunits depends on prevailing network conditions, i.e., the amount of SEIand parameter set NAL units being transmitted out of order. Therefore,“all types of NAL units” was not seen as a good unit either.Consequently, specifying the amount of disorder in number VCL NAL unitswas considered as the best alternative. VCL NAL units are defined in theJVT coding standard to be coded slices, coded data partitions, orend-of-sequence markers.

The proposed parameter is the following: num-reorder-VCL-NAL-units. Itspecifies the maximum amount of VCL NAL units that precede any VCL NALunit in the packet stream in NAL unit delivery order and follow the VCLNAL unit in RTP sequence number order or in the composition order of theaggregation packet containing the VCL NAL unit.

The proposed parameter can be conveyed as an optional parameter in theMIME type announcement or as optional SDP fields. The proposed parametercan indicate decoder capability or stream characteristics or both,depending on the protocol and the phase of the session setup procedure.

The buffer size of a buffer built according to thenum-reorder-VCL-NAL-units parameter cannot be specified accurately inbytes. In order to allow designing of receivers where the bufferingmemory requirements are known accurately, specification of decoding timeconformance is proposed. Decoding time conformance is specified using ahypothetical buffering model that does not assume a constant input bitrate, but rather requires that streaming servers shall include the modelto guarantee that the transmitted packet stream conforms to the model.The specified hypothetical buffer model smoothes out possibly burstypacket rate and reorders NAL units from transmission order to thedecoding order so that the resulting bitstream can be input to thehypothetical decoder at a constant bit rate.

In the following description the invention is described by usingencoder-decoder based system, but it is obvious that the invention canalso be implemented in systems in which the video signals are stored.The stored video signals can be either uncoded signals stored beforeencoding, as encoded signals stored after encoding, or as decodedsignals stored after encoding and decoding process. For example, anencoder produces bitstreams in decoding order. A file system receivesaudio and/or video bitstreams which are encapsulated e.g. in decodingorder and stored as a file. In addition, the encoder and the file systemcan produce metadata which informs subjective importance of the picturesand NAL units, contains information on sub-sequences, inter alia. Thefile can be stored into a database from which a streaming server canread the NAL units and encapsulate them into RTP packets. According tothe optional metadata and the data connection in use, the streamingserver can modify the transmission order of the packets different fromthe decoding order, remove sub-sequences, decide what SEI-messages willbe transmitted, if any, etc. In the receiving end the RTP packets arereceived and buffered. Typically, the NAL units are first reordered intocorrect order and after that the NAL units are delivered to the decoder.

Furthermore, in the following description the invention is described byusing encoder-decoder based system, but it is obvious that the inventioncan also be implemented in systems where the encoder outputs andtransmits coded data to another component, such as a streaming server,in decoding order, where the other component reorders the coded datafrom the decoding order to another order and forwards the coded data inits reordered form to the decoder.

The method according to the present invention is primarily characterizedin that the buffer size is defined so that the total size of at leasttwo transmission units is defined and the maximum buffer size is definedon the basis of the total size. The system according to the presentinvention is primarily characterized in that the system furthercomprises a definer for defining the buffer size so that the total sizeof at least two transmission units is defined and the maximum buffersize is defined on the basis of the total size. The encoder according tothe present invention is primarily characterized in that the encoderfurther comprises a definer for defining the buffer size so that thetotal size of at least two transmission units is defined and the maximumbuffer size is defined on the basis of the total size. The decoderaccording to the present invention is primarily characterized in thatthe decoder further comprises a processor for allocating memory for thepre-decoding buffer according to a received parameter indicative of thebuffer size, and the buffer size is defined so that the total size of atleast two transmission units is defined and the maximum buffer size isdefined on the basis of the total size. The transmitting deviceaccording to the present invention is primarily characterized in thatthe transmitting device further comprising a definer for defining thebuffer size so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize. The receiving device according to the present invention isprimarily characterized in that the decoder further comprising aprocessor for allocating memory for the pre-decoding buffer according toa received parameter indicative of the buffer size, and the buffer sizeis defined so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize. The software program according to the present invention isprimarily characterized in that the buffer size is defined so that thetotal size of at least two transmission units is defined and the maximumbuffer size is defined on the basis of the total size. The storagemedium according to the present invention is primarily characterized inthat the buffer size is defined so that the total size of at least twotransmission units is defined and the maximum buffer size is defined onthe basis of the total size. The electronic device according to thepresent invention is primarily characterized in that the electronicdevice further comprises a definer for defining the buffer size so thatthe total size of at least two transmission units is defined and themaximum buffer size is defined on the basis of the total size.

Substitutive signalling to any decoding order information in the videobitstream is presented in the following according to an advantageousembodiment of the present invention. A Decoding Order Number (DON)indicates the decoding order of NAL units, in other the delivery orderof the NAL units to the decoder. Hereinafter, DON is assumed to be a16-bit unsigned integer without the loss of generality. Let DON of oneNAL unit be D1 and DON of another NAL unit be D2. If D1<D2 andD2−D1<32768, or if D1>D2 and D1−D2>=32768, then the NAL unit having DONequal to D2 precedes the NAL unit having DON equal to D2 in NAL unitdelivery order. If D1<D2 and D2−D1>=32768, or if D1>D2 and D1−D2<32768,then the NAL unit having DON equal to D2 precedes the NAL unit havingDON equal to D1 in NAL unit delivery order. NAL units associated withdifferent primary coded pictures do not have the same value of DON. NALunits associated with the same primary coded picture may have the samevalue of DON. If all NAL units of a primary coded picture have the samevalue of DON, NAL units of a redundant coded picture associated with theprimary coded picture should have the same value of DON as the NAL unitsof the primary coded picture. The NAL unit delivery order of NAL unitshaving the same value of DON is preferably the following:

-   -   1. Picture delimiter NAL unit, if any    -   2. Sequence parameter set NAL units, if any    -   3. Picture parameter set NAL units, if any    -   4. SEI NAL units, if any    -   5. Coded slice and slice data partition NAL units of the primary        coded picture, if any    -   6. Coded slice and slice data partition NAL units of the        redundant coded pictures, if any    -   7. Filler data NAL units, if any    -   8. End of sequence NAL unit, if any    -   9. End of stream NAL unit, if any.

The present invention improves the buffering efficiency of the codingsystems. By using the present invention it is possible to inform thedecoding device how much pre-decoding buffering is required. Therefore,there is no need to allocate more memory for the pre-decoding bufferthan necessary in the decoding device. Also, pre-decoding bufferoverflow can be avoided.

DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example of a recursive temporal scalability scheme,

FIG. 2 depicts a scheme referred to as Video Redundancy Coding, where asequence of pictures is divided into two or more independently codedthreads in an interleaved manner,

FIG. 3 presents an example of a prediction structure potentiallyimproving compression efficiency,

FIG. 4 presents an example of the intra picture postponement method thatcan be used to improve error resiliency,

FIGS. 5 a and 5 b disclose different prior art numbering schemes forpictures of encoded video stream,

FIG. 6 a discloses an example of a picture stream containingsub-sequences at layer 1,

FIG. 6 b discloses an example of a picture stream containing two groupof pictures having sub-sequences at layer 1,

FIG. 6 c discloses an example of a picture stream of different group ofpictures,

FIG. 7 discloses another example of a picture stream containingsub-sequences at layer 1,

FIG. 8 depicts an advantageous embodiment of the system according to thepresent invention,

FIG. 9 depicts an advantageous embodiment of the encoder according tothe present invention,

FIG. 10 depicts an advantageous embodiment of the decoder according tothe present invention,

FIG. 11 a discloses an example of the NAL packet format which can beused with the present invention,

FIG. 11 b discloses another example of the NAL packet format which canbe used with the present invention, and

FIG. 12 depicts an example of buffering of transmission units in apredecoder buffer.

DETAILED DESCRIPTION OF THE INVENTION

The general concept behind de-packetization rules is to reordertransmission units such as NAL units from transmission order to the NALunit decoding order.

The receiver includes a receiver buffer (or a predecoder buffer), whichis used to reorder packets from transmission order to the NAL unitdecoding order. In an example embodiment of the present invention thereceiver buffer size is set, in terms of number of bytes, equal to orgreater than the value of a deint-buf-size parameter, for example to avalue 1.2*the value of deint-buf-size MIME parameter. The receiver mayalso take buffering for transmission delay jitter into account andeither reserve a separate buffer for transmission delay jitter bufferingor combine the buffer for transmission delay jitter with the receiverbuffer (and hence reserve some additional space for delay jitterbuffering in the receiver buffer).

The receiver stores incoming NAL units in reception order into thereceiver buffer as follows. NAL units of aggregation packets are storedinto the receiver buffer individually. The value of DON is calculatedand stored for all NAL units.

Hereinafter, let N be the value of the optionalnum-reorder-VCL-NAL-units parameter (interleaving-depth parameter) whichspecifies the maximum amount of VCL NAL units that precede any VCL NALunit in the packet stream in NAL unit transmission order and follow theVCL NAL unit in decoding order. If the parameter is not present, a 0value number could be implied.

When the video stream transfer session is initialized, the receiver 8allocates memory for the receiving buffer 9.1 for storing at least Npieces of VCL NAL units. The receiver then starts to receive the videostream and stores the received VCL NAL units into the receiving buffer.The initial buffering lasts

-   -   until at least N pieces of VCL NAL units are stored into the        receiving buffer 9.1, or    -   if max-don-diff MIME parameter is present, until the value of a        function don_diff(m,n) is greater than the value of        max-don-diff, in which n corresponds to the NAL unit having the        greatest value of AbsDON among the received NAL units and m        corresponds to the NAL unit having the smallest value of AbsDON        among the received NAL units, or    -   until initial buffering has lasted for the duration equal to or        greater than the value of the optional init-buf-time MIME        parameter.

The function don_diff(m,n) is specified as follows:If DON(m)==DON(n), don_diff(m,n)=0If (DON(m)<DON(n) and DON(n)−DON(m)<32768), don_diff(m,n)=DON(n)−DON(m)If (DON(m)>DON(n) and DON(m)−DON(n)>=32768),don_diff(m,n)=65536−DON(m)+DON(n)If (DON(m)<DON(n) and DON(n)−DON(m) >=32768),don_diff(m,n)=−(DON(m)+65536−DON(n)If (DON(m)>DON(n) and DON(m)−DON(n) <32768),don_diff(m,n)=−(DON(m)−DON(n)where DON(i) is the decoding order number of the NAL unit having index iin the transmission order.

A positive value of don_diff(m,n) indicates that the NAL unit havingtransmission order index n follows, in decoding order, the NAL unithaving transmission order index m.

AbsDON denotes such decoding order number of the NAL unit that does notwrap around to 0 after 65535. In other words, AbsDON is calculated asfollows:

Let m and n are consecutive NAL units in transmission order. For thevery first NAL unit in transmission order (whose index is 0),AbsDON(0)=DON(0). For other NAL units, AbsDON is calculated as follows:If DON(m)==DON(n), AbsDON(n)=AbsDON(m)If (DON(m)<DON(n) and DON(n)−DON(m)<32768),AbsDON(n)=AbsDON(m)+DON(n)−DON(m)If (DON(m)>DON(n) and DON(m)−DON(n)>=32768),AbsDON(n)=AbsDON(m)+65536−DON(m)+DON(n)If (DON(m)<DON(n) and DON(n)−DON(m)>=32768),AbsDON(n)=AbsDON(m)−(DON(m)+65536−DON(n)If (DON(m)>DON(n) and DON(m)−DON(n)<32768),AbsDON(n)=AbsDON(m)−DON(m)−DON(n)where DON(i) is the decoding order number of the NAL unit having index iin the transmission order.

There are usually two buffering states in the receiver: initialbuffering and buffering while playing. Initial buffering occurs when theRTP session is initialized. After initial buffering, decoding andplayback is started and the buffering-while-playing mode is used.

When the receiver buffer 9.1 contains at least N VCL NAL units, NALunits are removed from the receiver buffer 9.1 one by one and passed tothe decoder 2. The NAL units are not necessarily removed from thereceiver buffer 9.1 in the same order in which they were stored, butaccording to the DON of the NAL units, as described below. The deliveryof the packets to the decoder 2 is continued until the buffer containsless than N VCL NAL units, i.e. N−1 VCL NAL units.

The NAL units to be removed from the receiver buffer are determined asfollows:

-   -   If the receiver buffer contains at least N VCL NAL units, NAL        units are removed from the receiver buffer and passed to the        decoder in the order specified below until the buffer contains        N−1 VCL NAL units.    -   If max-don-diff is present, all NAL units m for which        don_diff(m,n) is greater than max-don-diff are removed from the        receiver buffer and passed to the decoder in the order specified        below. Herein, n corresponds to the NAL unit having the greatest        value of AbsDON among the received NAL units.    -   A variable ts is set to the value of a system timer that was        initialized to 0 when the first packet of the NAL unit stream        was received. If the receiver buffer contains a NAL unit whose        reception time tr fulfills the condition that        ts−tr>init-buf-time, NAL units are passed to the decoder (and        removed from the receiver buffer) in the order specified below        until the receiver buffer contains no NAL unit whose reception        time tr fulfills the specified condition.

The order that NAL units are passed to the decoder is specified asfollows.

Let PDON be a variable that is initialized to 0 at the beginning of thean RTP session. For each NAL unit associated with a value of DON, a DONdistance is calculated as follows. If the value of DON of the NAL unitis larger than the value of PDON, the DON distance is equal to DON−PDON.Otherwise, the DON distance is equal to 65535−DON+DON+1.

NAL units are delivered to the decoder in ascending order of DONdistance. If several NAL units share the same value of DON distance,they can be passed to the decoder in any order. When a desired number ofNAL units have been passed to the decoder, the value of PDON is set tothe value of DON for the last NAL unit passed to the decoder.

Additional De-Packetization Guidelines

The following additional de-packetization rules may be used to implementan operational H.264 de-packetizer:

Intelligent RTP receivers (e.g. in gateways) may identify lost codedslice data partitions A (DPAs). If a lost DPA is found, a gateway maydecide not to send the corresponding coded slice data partitions B andC, as their information is meaningless for H.264 decoders. In this way anetwork element can reduce network load by discarding useless packets,without parsing a complex bitstream.

Intelligent RTP receivers (e.g. in gateways) may identify lost FractionaUnits (FU). If a lost FU is found, a gateway may decide not to send thefollowing FUs of the same NAL unit, as their information is meaninglessfor H.264 decoders. In this way a network element can reduce networkload by discarding useless packets, without parsing a complex bitstream.

Intelligent receivers having to discard packets or NALUs could firstdiscard all packets/NALUs in which the value of the NRI field of the NALunit type octet is equal to 0. This may minimize the impact on userexperience.

In the following a parameter to be used for indicating the maximumbuffer size in the decoder will be described. The parameterdeint-buf-size is normally not present when a packetization-modeparameter indicative of the packetization mode is not present or thevalue of the packetization-mode parameter is equal to 0 or 1. Thisparameter should be present when the value of the packetization-modeparameter is equal to 2.

The value of the deint-buf-size parameter is specified in associationwith the following hypothetical deinterleaving buffer model. At thebeginning, the hypothetical deinterleaving buffer is empty and themaximum buffer occupancy m is set to 0. The following process is used inthe model:

i) The next VCL NAL unit in transmission order is inserted to thehypothetical deinterleaving buffer.

ii) Let s be the total size of VCL NAL units in the buffer in terms ofbytes.

iii) If s is greater than m, m is set equal to s.

iv) If the number of VCL NAL units in the hypothetical deinterleavingbuffer is less than or equal to the value of interleaving-depth, theprocess is continued from stage vii.

v) The VCL NAL unit earliest in decoding order among the VCL NAL unitsin the hypothetical deinterleaving buffer is determined from the valuesof DON for the VCL NAL units according to section 5.5 of RFC XXXX.

vi) The earliest VCL NAL unit is removed from the hypotheticaldeinterleaving buffer.

vii) If there are no more VCL NAL units in transmission order, theprocess is terminated.

viii) The process is continued from stage i.

This parameter signals the properties of a NAL unit stream or thecapabilities of a receiver implementation. When the parameter is used tosignal the properties of a NAL unit stream, the value of the parameter,referred to as v, is such that:

a) the value of m resulting when the NAL unit stream is entirelyprocessed by the hypothetical deinterleaving buffer model is less thanor equal to v, or

b) the order of VCL NAL units determined by removing the earliest VCLNAL unit from a deinterleaving buffer as long as there is a bufferoverflow is the same as the removal order of VCL NAL units from thehypothetical deinterleaving buffer.

Consequently, it is guaranteed that receivers can reconstruct VCL NALunit decoding order, when the buffer size for VCL NAL unit decodingorder recovery is at least the value of deint-buf-size in terms ofbytes.

When the parameter is used to signal the capabilities of a receiverimplementation, the receiver is able to correctly reconstruct the VCLNAL unit decoding order of any NAL unit stream that are characterized bythe same value of deint-buf-size. When the receiver buffers such numberof bytes that equals to or is greater than the value of deint-buf-size,it is able to reconstruct VCL NAL unit decoding order from thetransmission order.

The non-VCL NAL units should also be taken into account when determiningthe size of the deinterleaving buffer. When this parameter is present, asufficient size of the deinterleaving buffer for all NAL units is lessthan or equal to 20% larger than the value of the parameter.

If the parameter is not present, then a value of 0 is used fordeint-buf-size. The value of deint-buf-size is an integer in the rangeof, for example, 0 to 4 294 967 295, inclusive.

In the following the invention will be described in more detail withreference to the system of FIG. 8, the encoder 1 and the hypotheticalreference decoder (HRD) 5 of FIG. 9 and decoder 2 of FIG. 10. Thepictures to be encoded can be, for example, pictures of a video streamfrom a video source 3, e.g. a camera, a video recorder, etc. Thepictures (frames) of the video stream can be divided into smallerportions such as slices. The slices can further be divided into blocks.In the encoder 1 the video stream is encoded to reduce the informationto be transmitted via a transmission channel 4, or to a storage media(not shown). Pictures of the video stream are input to the encoder 1.The encoder has an encoding buffer 1.1 (FIG. 9) for temporarily storingsome of the pictures to be encoded. The encoder 1 also includes a memory1.3 and a processor 1.2 in which the encoding tasks according to theinvention can be applied. The memory 1.3 and the processor 1.2 can becommon with the transmitting device 6 or the transmitting device 6 canhave another processor and/or memory (not shown) for other functions ofthe transmitting device 6. The encoder 1 performs motion estimationand/or some other tasks to compress the video stream. In motionestimation similarities between the picture to be encoded (the currentpicture) and a previous and/or latter picture are searched. Ifsimilarities are found the compared picture or part of it can be used asa reference picture for the picture to be encoded. In JVT the displayorder and the decoding order of the pictures are not necessarily thesame, wherein the reference picture has to be stored in a buffer (e.g.in the encoding buffer 1.1) as long as it is used as a referencepicture. The encoder 1 also inserts information on display order of thepictures into the transmission stream.

From the encoding process the encoded pictures are moved to an encodedpicture buffer 5.2, if necessary. The encoded pictures are transmittedfrom the encoder 1 to the decoder 2 via the transmission channel 4. Inthe decoder 2 the encoded pictures are decoded to form uncompressedpictures corresponding as much as possible to the encoded pictures. Eachdecoded picture is buffered in the DPB 2.1 of the decoder 2 unless it isdisplayed substantially immediately after the decoding and is not usedas a reference picture. In the system according to the present inventionboth the reference picture buffering and the display picture bufferingare combined and they use the same decoded picture buffer 2.1. Thiseliminates the need for storing the same pictures in two differentplaces thus reducing the memory requirements of the decoder 2.

The decoder 1 also includes a memory 2.3 and a processor 2.2 in whichthe decoding tasks according to the invention can be applied. The memory2.3 and the processor 2.2 can be common with the receiving device 8 orthe receiving device 8 can have another processor and/or memory (notshown) for other functions of the receiving device 8.

Encoding

Let us now consider the encoding-decoding process in more detail.Pictures from the video source 3 are entered to the encoder 1 andadvantageously stored in the encoding buffer 1.1. The encoding processis not necessarily started immediately after the first picture isentered to the encoder, but after a certain amount of pictures areavailable in the encoding buffer 1.1. Then the encoder 1 tries to findsuitable candidates from the pictures to be used as the referenceframes. The encoder 1 then performs the encoding to form encodedpictures. The encoded pictures can be, for example, predicted pictures(P), bi-predictive pictures (B), and/or intra-coded pictures (I). Theintra-coded pictures can be decoded without using any other pictures,but other type of pictures need at least one reference picture beforethey can be decoded. Pictures of any of the above mentioned picturetypes can be used as a reference picture.

The encoder advantageously attaches two time stamps to the pictures: adecoding time stamp (DTS) and output time stamp (OTS). The decoder canuse the time stamps to determine the correct decoding time and time tooutput (display) the pictures. However, those time stamps are notnecessarily transmitted to the decoder or it does not use them.

The encoder also forms sub-sequences on one or more layers above thelowest layer 0. The pictures on layer 0 are independently decodable, butthe pictures on higher layers may depend on pictures on some lower layeror layers. In the example of FIG. 6 a there are two layers: layer 0 andlayer 1. The pictures 10, P6 and P12 belong to the layer 0 while otherpictures P1-P5, P7-P11 shown on FIG. 6 a belong to the layer 1.Advantageously, the encoder forms groups of pictures (GOP) so that eachpicture of one GOP can be reconstructed by using only the pictures inthe same GOP. In other words, one GOP contains at least oneindependently decodable picture and all the other pictures for which theindependently decodable picture is a reference picture. In the exampleof FIG. 7, there are two group of pictures. The first group of picturesincludes the pictures 10(0), P1(0), P3(0) on layer 0, and picturesB2(0), 2xB3(0), B4(0), 2xB5(0), B6(0), P5(0), P6(0) on layer 1. Thesecond group of pictures includes the pictures 10(1), and P1(1) on layer0, and pictures 2xB3(1) and B2(1) on layer 1. The pictures on layer 1 ofeach group of pictures are further arranged as sub-sequences. The firstsub-sequence of the first group of pictures contains pictures B3(0),B2(0), B3(0), the second sub-sequence contains pictures B5(0), B4(0),B5(0), and the third sub-sequence contains pictures B6(0), P5(0), P6(0).The sub-sequence of the second group of pictures contains picturesB3(1), B2(1), B3(1). The numbers in brackets indicate the video sequenceID defined for the group of pictures in which the picture belongs.

The video sequence ID is transferred for each picture. It can beconveyed within the video bitstream, such as in the SupplementalEnhancement Information data. The video sequence ID can also betransmitted in the header fields of the transport protocol, such aswithin the RTP payload header of the JVT coding standard. The videosequence ID according to the presented partitioning to independent GOPscan be stored in the metadata of the video file format, such as in theMPEG-4 AVC file format. FIGS. 11 a and 11 b disclose examples of the NALpacket formats which can be used with the present invention. The packetcontains a header 11 and a payload part 12. The header 11 containsadvantageously an error indicator field 11.1 (F, Forbidden), a priorityfield 11.2, and a type field 11.3. The error indicator field 11.1indicates a bit error free NAL unit. Advantageously, when the errorindicator field is set, the decoder is advised that bit errors may bepresent in the payload or in the NALU type octet. Decoders that areincapable of handling bit errors can then discard such packets. Thepriority field 11.2 is used for indicating the importance of the pictureencapsulated in the payload part 12 of the packet. In an exampleimplementation, the priority field can have four different values asfollows. A value of 00 indicates that the content of the NALU is notused to reconstruct stored pictures (that can be used for futurereference). Such NALUs can be discarded without risking the integrity ofthe reference pictures. Values above 00 indicate that the decoding ofthe NALU is required to maintain the integrity of the referencepictures. Furthermore, values above 00 indicate the relative transportpriority, as determined by the encoder. Intelligent network elements canuse this information to protect more important NALUs better than lessimportant NALUs. 11 is the highest transport priority, followed by 10,then by 01 and, finally, 00 is the lowest.

The payload part 12 of the NALU contains at least a video sequence IDfield 12.1, a field indicator 12.2, size field 12.3, timing info 12.4and the encoded picture information 12.5. The video sequence ID field12.1 is used for storing the number of the video sequence in which thepicture belongs to. The field indicator 12.2 is used to signal whetherthe picture is a first or a second frame when two-frame picture formatis used. Both frames may be coded as separate pictures. The first fieldindicator equal to 1 advantageously signals that the NALU belongs to acoded frame or a coded field that precedes the second coded field of thesame frame in decoding order. The first field indicator equal to 0signals that the NALU belongs to a coded field that succeeds the firstcoded field of the same frame in decoding order. The timing info field11.3 is used for transforming time related information, if necessary.

The NAL units can be delivered in different kind of packets. In thisadvantageous embodiment the different packet formats include simplepackets and aggregation packets. The aggregation packets can further bedivided into single-time aggregation packets and multi-time aggregationpackets.

A simple packet according to this invention consists of one NALU. A NALunit stream composed by decapsulating Simple Packets in RTP sequencenumber order should conform to the NAL unit delivery order.

Aggregation packets are the packet aggregation scheme of this payloadspecification. The scheme is introduced to reflect the dramaticallydifferent MTU sizes of two different type of networks—wireline IPnetworks (with an MTU size that is often limited by the Ethernet MTUsize—roughly 1500 bytes), and IP or non-IP (e.g. H.324/M) based wirelessnetworks with preferred transmission unit sizes of 254 bytes or less. Inorder to prevent media transcoding between the two worlds, and to avoidundesirable packetization overhead, a packet aggregation scheme isintroduced.

Single-Time Aggregation Packet (STAP) aggregate NALUs with identicalNALU-time. Respectively, Multi-Time Aggregation Packets (MTAP) aggregateNALUs with potentially differing NALU-time. Two different MTAPs aredefined that differ in the length of the NALU timestamp offset. The termNALU-time is defined as the value the RTP timestamp would have if thatNALU would be transported in its own RTP packet.

MTAPs and STAP share the following non-limiting packetization rulesaccording to an advantageous embodiment of the present invention. TheRTP timestamp must be set to the minimum of the NALU times of all theNALUs to be aggregated. The Type field of the NALU type octet must beset to the appropriate value as indicated in table 1. The errorindicator field 11.1 must be cleared if all error indicator fields ofthe aggregated NALUs are zero, otherwise it must be set. TABLE 1 TypePacket Timestamp offset field length (in bits) 0 × 18 STAP 0 0 × 19MTAP16 16 0 × 20 MTAP24 24

The NALU Payload of an aggregation packet consists of one or moreaggregation units. An aggregation packet can carry as many aggregationunits as necessary, however the total amount of data in an aggregationpacket obviously must fit into an IP packet, and the size should bechosen such that the resulting IP packet is smaller than the MTU size.

Single-Time Aggregation Packet (STAP) should be used wheneveraggregating NALUs that share the same NALU-time. The NALU payload of anSTAP consists of the video sequence ID field 12.1 (e.g. 7 bits) and thefield indicator 12.2 followed by Single-Picture Aggregation Units(SPAU).

In another alternative embodiment the NALU payload of an Single-PictureAggregation Packet (STAP) consists of a 16-bit unsigned decoding ordernumber (DON) followed by Single-Picture Aggregation Units (SPAU).

A video sequence according to this specification can be any part of NALUstream that can be decoded independently from other parts of the NALUstream.

A frame consists of two fields that may be coded as separate pictures.The first field indicator equal to 1 signals that the NALU belongs to acoded frame or a coded field that precedes the second coded field of thesame frame in decoding order. The first field indicator equal to 0signals that the NALU belongs to a coded field that succeeds the firstcoded field of the same frame in decoding order.

A Single-Picture Aggregation Unit consists of e.g. 16-bit unsigned sizeinformation that indicates the size of the following NALU in bytes(excluding these two octets, but including the NALU type octet of theNALU), followed by the NALU itself including its NALU type byte.

A Multi-Time Aggregation Packet (MTAP) has a similar architecture as anSTAP. It consists of the NALU header byte and one or more Multi-PictureAggregation Units. The choice between the different MTAP fields isapplication dependent—the larger the timestamp offset is the higher isthe flexibility of the MTAP, but the higher is also the overhead.

Two different Multi-Time Aggregation Units are defined in thisspecification. Both of them consist of e.g. 16 bits unsigned sizeinformation of the following NALU (same as the size information of inthe STAP). In addition to these 16 bits there are also the videosequence ID field 12.1 (e.g. 7 bits), the field indicator 12.2 and nbits of timing information for this NALU, whereby n can e.g. be 16 or24. The timing information field has to be set so that the RTP timestampof an RTP packet of each NALU in the MTAP (the NALU-time) can begenerated by adding the timing information from the RTP timestamp of theMTAP.

In another alternative embodiment the Multi-Time Aggregation Packet(MTAP) consists of the NALU header byte, a decoding order number base(DONB) field 12.1 (e.g. 16 bits), and one or more Multi-PictureAggregation Units. The two different Multi-Time Aggregation Units are inthis case defined as follows. Both of them consist of e.g. 16 bitsunsigned size information of the following NALU (same as the sizeinformation of in the STAP). In addition to these 16 bits there are alsothe decoding order number delta (DOND) field 12.5 (e.g. 7 bits), and nbits of timing information for this NALU, whereby n can e.g. be 16 or24. DON of the following NALU is equal to DONB+DOND. The timinginformation field has to be set so that the RTP timestamp of an RTPpacket of each NALU in the MTAP (the NALU-time) can be generated byadding the timing information from the RTP timestamp of the MTAP. DONBshall contain the smallest value of DON among the NAL units of the MTAP.

The behaviour of the buffering model according to the present inventionis advantageously controlled with the following parameters: the initialinput period (e.g. in clock ticks of a 90-kHz clock) and the size of thehypothetical packet input buffer (e.g. in bytes). Preferably, thedefault initial input period and the default size of the hypotheticalpacket input buffer are 0. PSS clients may signal their capability ofproviding a larger buffer in the capability exchange process.

The maximum video bit-rate can be signalled, for example, in themedia-level bandwidth attribute of SDP, or in a dedicated SDP parameter.If the video-level bandwidth attribute was not present in thepresentation description, the maximum video bit-rate is definedaccording to the video coding profile and level in use.

Initial parameter values for each stream can be signalled within the SDPdescription of the stream, for example using the MIME type parameters orsimilar non-standard SDP parameters. Signalled parameter values overridethe corresponding default parameter values. The values signalled withinthe SDP description guarantee pauseless playback from the beginning ofthe stream until the end of the stream (assuming a constant-delayreliable transmission channel).

PSS servers may update parameter values in the response for an RTSP PLAYrequest. If an updated parameter value is present, it shall replace thevalue signalled in the SDP description or the default parameter value inthe operation of the PSS buffering model. An updated parameter value isvalid only in the indicated playback range, and it has no effect afterthat. Assuming a constant-delay reliable transmission channel, theupdated parameter values guarantee pauseless playback of the actualrange indicated in the response for the PLAY request. The indicated sizeof the hypothetical input packet buffer and initial input period shallbe smaller than or equal to the corresponding values in the SDPdescription or the corresponding default values, whichever ones arevalid.

The server buffering verifier is specified according to the specifiedbuffering model. The model is based on a hypothetical packet inputbuffer.

The buffering model is presented next. The buffer is initially empty. APSS Server adds each transmitted RTP packet having video payload to thehypothetical packet input buffer 1.1 immediately when it is transmitted.All protocol headers at RTP or any lower layer are removed. Data is notremoved from the hypothetical packet input buffer during a period calledthe initial input period. The initial input period starts when the firstRTP packet is added to the hypothetical packet input buffer. When theinitial input period has expired, removal of data from the hypotheticalpacket input buffer is started. Data removal happens advantageously atthe maximum video bit-rate, unless the hypothetical packet input buffer1.1 is empty. Data removed from the hypothetical packet input buffer 1.1is input to the Hypothetical Reference Decoder 5. The hypotheticalreference decoder 5 performs the hypothetical decoding process to ensurethat the encoded video stream is decodable according to the setparameters, or if the hypothetical reference decoder 5 notices that e.g.the picture buffer 5.2 of the hypothetical reference decoder 5overflows, the buffer parameters can be modified. In that case the newparameters are also transmitted to the receiving device 8, in which thebuffers are re-initialized accordingly.

The encoding and transmitting device 1, such as a PSS server, shallverify that a transmitted RTP packet stream complies with the followingrequirements:

-   -   The buffering model shall be used with the default or signalled        buffering parameter values. Signalled parameter values override        the corresponding default parameter values.    -   The occupancy of the hypothetical packet input buffer shall not        exceed the default or signalled buffer size.    -   The output bitstream of the hypothetical packet input buffer        shall conform to the definitions of the Hypothetical Reference        Decoder.

When the buffering model is in use, the PSS client shall be capable ofreceiving an RTP packet stream that complies with the PSS serverbuffering verifier, when the RTP packet stream is carried over aconstant-delay reliable transmission channel. Furthermore, the decoderof the PSS client shall output frames at the correct rate defined by theRTP time-stamps of the received packet stream.

Transmission

The transmission and/or storing of the encoded pictures (and theoptional virtual decoding) can be started immediately after the firstencoded picture is ready. This picture is not necessarily the first onein decoder output order because the decoding order and the output ordermay not be the same.

When the first picture of the video stream is encoded the transmissioncan be started. The encoded pictures are optionally stored to theencoded picture buffer 1.2. The transmission can also start at a laterstage, for example, after a certain part of the video stream is encoded.

The decoder 2 should also output the decoded pictures in correct order,for example by using the ordering of the picture order counts, and hencethe reordering process need be defined clearly and normatively.

De-Packetizing

The de-packetization process is implementation dependent. Hence, thefollowing description is a non-restrictive example of a suitableimplementation. Other schemes may be used as well. Optimizationsrelative to the described algorithms are likely possible.

The general concept behind these de-packetization rules is to reorderNAL units from transmission order to the NAL unit delivery order.

Decoding

Next, the operation of the receiver 8 will be described. The receiver 8collects all packets belonging to a picture, bringing them into areasonable order. The strictness of the order depends on the profileemployed. The received packets are stored into the receiving buffer 9.1(pre-decoding buffer). The receiver 8 discards anything that isunusable, and passes the rest to the decoder 2.

Aggregation packets are handled by unloading their payload intoindividual RTP packets carrying NALUs. Those NALUs are processed as ifthey were received in separate RTP packets, in the order they werearranged in the Aggregation Packet.

Hereinafter, let N be the value of the optionalnum-reorder-VCL-NAL-units MIME type parameter which specifies themaximum amount of VCL NAL units that precede any VCL NAL unit in thepacket stream in NAL unit delivery order and follow the VCL NAL unit inRTP sequence number order or in the composition order of the aggregationpacket containing the VCL NAL unit. If the parameter is not present, a 0value number could be implied.

When the video stream transfer session is initialized, the receiver 8allocates memory for the receiving buffer 9.1 for storing at least Npieces of VCL NAL units. The receiver then starts to receive the videostream and stores the received VCL NAL units into the receiving buffer,until at least N pieces of VCL NAL units are stored into the receivingbuffer 9.1.

When the receiver buffer 9.1 contains at least N VCL NAL units, NALunits are removed from the receiver buffer 9.1 one by one and passed tothe decoder 2. The NAL units are not necessarily removed from thereceiver buffer 9.1 in the same order in which they were stored, butaccording to the video sequence ID of the NAL units, as described below.The delivery of the packets to the decoder 2 is continued until thebuffer contains less than N VCL NAL units, i.e. N−1 VCL NAL units.

In FIG. 12 an example of buffering the transmission units in thepredecoder buffer of the decoder is depicted. The numbers refer to thedecoding order while the order of the transmission units refer to thetransmission order (and also to the receiving order).

Hereinafter, let PVSID be the video sequence ID (VSID) of the latest NALunit passed to the decoder. All NAL units in a STAP share the same VSID.The order that NAL units are passed to the decoder is specified asfollows: If the oldest RTP sequence number in the buffer corresponds toa Simple Packet, the NALU in the Simple Packet is the next NALU in theNAL unit delivery order. If the oldest RTP sequence number in the buffercorresponds to an Aggregation Packet, the NAL unit delivery order isrecovered among the NALUs conveyed in Aggregation Packets in RTPsequence number order until the next Simple Packet (exclusive). This setof NALUs is hereinafter referred to as the candidate NALUs. If no NALUsconveyed in Simple Packets reside in the buffer, all NALUs belong tocandidate NALUs.

For each NAL unit among the candidate NALUs, a VSID distance iscalculated as follows. If the VSID of the NAL unit is larger than PVSID,the VSID distance is equal to VSID−PVSID. Otherwise, the VSID distanceis equal to 2{circumflex over ( )}(number of bits used to signalVSID)−PVSID+VSID. NAL units are delivered to the decoder in ascendingorder of VSID distance. If several NAL units share the same VSIDdistance, the order to pass them to the decoder shall conform to the NALunit delivery order defined in this specification. The NAL unit deliveryorder can be recovered as described in the following.

First, slices and data partitions are associated with pictures accordingto their frame numbers, RTP timestamps and first field flags: all NALUssharing the same values of the frame number, the RTP timestamp and thefirst field flag belong to the same picture. SEI NALUs, sequenceparameter set NALUs, picture parameter set NALUs, picture delimiterNALUs, end of sequence NALUs, end of stream NALUs, and filler data NALUsbelong to the picture of the next VCL NAL unit in transmission order.

Second, the delivery order of the pictures is concluded based onnal_ref_idc, the frame number, the first field flag, and the RTPtimestamp of each picture. The delivery order of pictures is inascending order of frame numbers (in modulo arithmetic). If severalpictures share the same value of frame number, the picture(s) that havenal_ref_idc equal to 0 are delivered first. If several pictures sharethe same value of frame number and they all have nal_ref_idc equal to 0,the pictures are delivered in ascending RTP timestamp order. If twopictures share the same RTP timestamp, the picture having first fieldflag equal to 1 is delivered first. Note that a primary coded pictureand the corresponding redundant coded pictures are herein considered asone coded picture.

Third, if the video decoder in use does not support Arbitrary SliceOrdering, the delivery order of slices and A data partitions is inascending order of the first_mb_in_slice syntax element in the sliceheader. Moreover, B and C data partitions immediately follow thecorresponding A data partition in delivery order.

In the above the terms PVSID and VSID were used. Terms PDON (thedecoding order number of the previous NAL unit of an aggregation packetin NAL unit delivery order) and DON (decoding order number) can be usedinstead as follows: Let PDON of the first NAL unit passed to the decoderbe 0. The order that NAL units are passed to the decoder is specified asfollows: If the oldest RTP sequence number in the buffer corresponds toa Simple Packet, the NALU in the Simple Packet is the next NALU in theNAL unit delivery order. If the oldest RTP sequence number in the buffercorresponds to an Aggregation Packet, the NAL unit delivery order isrecovered among the NALUs conveyed in Aggregation Packets in RTPsequence number order until the next Simple Packet (exclusive). This setof NALUs is hereinafter referred to as the candidate NALUs. If no NALUsconveyed in Simple Packets reside in the buffer, all NALUs belong tocandidate NALUs.

For each NAL unit among the candidate NALUs, a DON distance iscalculated as follows. If the DON of the NAL unit is larger than PDON,the DON distance is equal to DON−PDON. Otherwise, the DON distance isequal to 2{circumflex over ( )}(number of bits to represent an DON andPDON as an unsigned integer)−PDON+DON. NAL units are delivered to thedecoder in ascending order of DON distance.

If several NAL units share the same DON distance, the order to pass themto the decoder is:

-   -   1. Picture delimiter NAL unit, if any    -   2. Sequence parameter set NAL units, if any    -   3. Picture parameter set NAL units, if any    -   4. SEI NAL units, if any    -   5. Coded slice and slice data partition NAL units of the primary        coded picture, if any    -   6. Coded slice and slice data partition NAL units of the        redundant coded pictures, if any    -   7. Filler data NAL units, if any    -   8. End of sequence NAL unit, if any    -   9. End of stream NAL unit, if any.

If the video decoder in use does not support Arbitrary Slice Ordering,the delivery order of slices and A data partitions is ordered inascending order of the first_mb_in_slice syntax element in the sliceheader. Moreover, B and C data partitions immediately follow thecorresponding A data partition in delivery order.

The following additional de-packetization rules may be used to implementan operational JVT de-packetizer: NALUs are presented to the JVT decoderin the order of the RTP sequence number. NALUs carried in an AggregationPacket are presented in their order in the Aggregation packet. All NALUsof the Aggregation packet are processed before the next RTP packet isprocessed.

Intelligent RTP receivers (e.g. in Gateways) may identify lost DPAs. Ifa lost DPA is found, the Gateway MAY decide not to send the DPB and DPCpartitions, as their information is meaningless for the JVT Decoder. Inthis way a network element can reduce network load by discarding uselesspackets, without parsing a complex bit stream.

Intelligent receivers may discard all packets that have a NAL ReferenceIdc of 0. However, they should process those packets if possible,because the user experience may suffer if the packets are discarded.

The DPB 2.1 contains memory places for storing a number of pictures.Those places are also called as frame stores in the description. Thedecoder 2 decodes the received pictures in correct order. To do so thedecoder examines the video sequence ID information of the receivedpictures. If the encoder has selected the video sequence ID for eachgroup of pictures freely, the decoder decodes the pictures of the groupof pictures in the order in which they are received. If the encoder hasdefined for each group of pictures the video sequence ID by usingincrementing (or decrementing) numbering scheme, the decoder decodes thegroup of pictures in the order of video sequence IDs. In other words,the group of pictures having the smallest (or biggest) video sequence IDis decoded first.

The present invention can be applied in many kind of systems anddevices. The transmitting device 6 including the encoder 1 andoptionally the HRD 5 advantageously include also a transmitter 7 totransmit the encoded pictures to the transmission channel 4. Thereceiving device 8 include the receiver 9 to receive the encodedpictures, the decoder 2, and a display 10 on which the decoded picturescan be displayed. The transmission channel can be, for example, alandline communication channel and/or a wireless communication channel.The transmitting device and the receiving device include also one ormore processors 1.2, 2.2 which can perform the necessary steps forcontrolling the encoding/decoding process of video stream according tothe invention. Therefore, the method according to the present inventioncan mainly be implemented as machine executable steps of the processors.The buffering of the pictures can be implemented in the memory 1.3, 2.3of the devices. The program code 1.4 of the encoder can be stored intothe memory 1.3. Respectively, the program code 2.4 of the decoder can bestored into the memory 2.3.

It is obvious that the hypothetical reference decoder 5 can be situatedafter the encoder 1, so that the hypothetical reference decoder 5rearranges the encoded pictures, if necessary, and can ensure that thepre-decoding buffer of the receiver 8 does not overflow.

The present invention can be implemented in the buffering verifier whichcan be part of the hypothetical reference decoder 5 or it can beseparate from it.

1. A method for buffering encoded pictures, the method including anencoding step for forming encoded pictures in an encoder, a transmissionstep for transmitting said encoded pictures to a decoder as transmissionunits, a buffering step for buffering transmission units transmitted tothe decoder in a buffer, and a decoding step for decoding the encodedpictures for forming decoded pictures, wherein the buffer size isdefined so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize.
 2. The method according to claim 1, wherein the number oftransmission units used in the calculation of the total size is thefractional number of the necessary buffer size in terms of the number oftransmission units.
 3. The method according to claim 1, wherein thenumber of transmission units used in the calculation of the total sizeis the fractional number of the necessary buffer size in terms of thenumber of transmission units, wherein the fractional number is of theform 1/N N being an integer number.
 4. The method according to claim 1,wherein the number of transmission units used in the calculation of thetotal size is the same as the necessary buffer size in terms of thenumber of transmission units.
 5. The method according to claim 1,wherein the number of transmission units used in the calculation of thetotal size is expressed as in buffering order of the transmission units.6. A system comprising an encoder for encoding pictures, a transmitterfor transmitting said encoded pictures to a decoder as VCL NAL units, adecoder for decoding the encoded pictures for forming decoded pictures,the decoder comprising a buffer for buffering transmission unitstransmitted to the decoder, wherein the system further comprises adefiner for defining the buffer size so that the total size of at leasttwo transmission units is defined and the maximum buffer size is definedon the basis of the total size.
 7. An encoder for encoding picturescomprising a transmitter for transmitting said encoded pictures to adecoder as transmission units for buffering in a buffer and decoding,wherein the encoder further comprises a definer for defining the buffersize so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize.
 8. The encoder according to claim 7, wherein it comprises a bufferfor buffering the encoded pictures, and a hypothetical reference decoderfor determining buffering requirements for decoding of the encodedpictures.
 9. A decoder for decoding the encoded pictures for formingdecoded pictures, comprising a pre-decoding buffer for bufferingreceived encoded pictures for decoding, wherein the decoder furthercomprises a processor for allocating memory for the pre-decoding bufferaccording to a received parameter indicative of the buffer size, and thebuffer size is defined so that the total size of at least twotransmission units is defined and the maximum buffer size is defined onthe basis of the total size.
 10. A software program product comprisingmachine executable steps stored in a readable memory for performing amethod for buffering encoded pictures when said steps are executed on aprocessor, the method including an encoding step for forming encodedpictures in an encoder, a transmission step for transmitting saidencoded pictures to a decoder as transmission units, a buffering stepfor buffering transmission units transmitted to the decoder in a buffer,and a decoding step for decoding the encoded pictures for formingdecoded pictures, wherein the buffer size is defined so that the totalsize of at least two transmission units is defined and the maximumbuffer size is defined on the basis of the total size.
 11. A storagemedium for storing a software program comprising machine executablesteps for performing a method for buffering encoded pictures, the methodincluding an encoding step for forming encoded pictures in an encoder, atransmission step for transmitting said encoded pictures to a decoder astransmission units, a buffering step for buffering transmission unitstransmitted to the decoder in a buffer, and a decoding step for decodingthe encoded pictures for forming decoded pictures, wherein the buffersize is defined so that the total size of at least two transmissionunits is defined and the maximum buffer size is defined on the basis ofthe total size.
 12. An electronic device comprising an encoder forencoding pictures, and a transmitter for transmitting said encodedpictures to a decoder as transmission units for buffering in a bufferand decoding, wherein the electronic device further comprises a definerfor defining the buffer size so that the total size of at least twotransmission units is defined and the maximum buffer size is defined onthe basis of the total size.
 13. A signal including encoded pictures astransmission units, for which buffering requirements for decoding of theencoded pictures is determined, wherein a parameter indicative of thebuffer size so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize, and said parameter is attached to the signal.
 14. A transmittingdevice, which comprises an encoder for encoding pictures comprising atransmitter for transmitting said encoded pictures to a decoder astransmission units for buffering in a buffer and decoding, wherein thetransmitting device further comprises a definer for defining the buffersize so that the total size of at least two transmission units isdefined and the maximum buffer size is defined on the basis of the totalsize.
 15. A receiving device, which comprises decoder for decoding theencoded pictures for forming decoded pictures, comprising a pre-decodingbuffer for buffering received encoded pictures for decoding, wherein thedecoder further comprises a processor for allocating memory for thepre-decoding buffer according to a received parameter indicative of thebuffer size, and the buffer size is defined so that the total size of atleast two transmission units is defined and the maximum buffer size isdefined on the basis of the total size.